Readme for replication package to the paper "A Case Study on Causal Models as Support for Prototyping of Software with Machine Learning"

Folder Structure

Project_Name/
│
├── data/							// datasets used for the experiment
│   ├── Basic_dataset.csv
│   ├── CDD_based_dataset.csv
│   └── External_dataset.csv
│
│── trained_models/	
│   ├── fnn_basic_no_fft_model_9.h5	           // trained models used in publication
│   ├── fnn_basic_with_fft_model_9.h5
│   ├── fnn_CDD_no_fft_model_9.h5
│   ├── fnn_CDD_with_fft_model_9.h5
│   ├── fnn_with_fft.txt                        // inference results
│   └── fnn_no_fft.txt                        
│
├── models/							// notebook for model training and comparison
│   ├── FNN_with_FFT.ipynb
│   └── FNN_without_FFT.ipynb
│
├── cross_validation/					// train validation results to avoid overfitting
│   ├── Basic_without_fft_epoch_100.png	      // model type_fft or not_epoch number
│   ├── CDD_with_fft_epoch_300.png
│   ├── CDD_fnn_without_fft_100.png
│   └── Basic_with_fft_epoch_100.png
│
├── workshops/						// outcome of the workshops conducted in phase 1 & 2
│   ├── model_arc_detection.svg
│   ├── model_motor_diagnostics.svg
│   ├── workshop_1_20221017.pdf
│   ├── workshop_2_20221118.pdf
│   ├── workshop_3_20230113.pdf
│   ├── workshop_4_20230120.pdf
│   └── workshop_5_20230127.pdf
│
├── bayesian_analysis.ipynb	           // code for significance testing
├── data_plotting.ipynb                  // code for data plotting
├── parsing_results.ipynb                // code for parsing inference results
├── environment_causality.yml            // conda environment file for model training and inference
└── environment_pymc.yml                 // conda environment file for Bayesian testing         

# Use Anaconda to create Python environments
Two environment files are provided. The causality environment is used to train the models and to perform model inference. The Pymc environment is used for Bayesian significance analysis of the results. 

# Dataset
Each dataset has the size of (number of samples, 161). 
Every experiment for arc generation has 2 to 4 seconds. As the sampling rate is 16kHz, this means every experiment has 200 to 400 raws of data, and every raw has data of 10ms.

One raw has 161 data, the first data is the label and the rest are data points for current value for that 10ms.

# Models
The folder contains two notebooks that has necceary functions for the validation of causal model.

# cross_validation
This folder has figures with large epoch number set for training. This reflects how the model accuracy on validation set change along the epoch. And some shows overfitting with large epoch number.